class: center, middle, inverse, title-slide .title[ # Introduction to Simple Linear Regression ] .author[ ### .font1[.] ] .author[ ### .font110[Zhaohu (Jonathan) Fan] ] .author[ ### .font70[Department of Operations, Business Analytics and Information Systems] ] .author[ ### .font70[Carl H. Lindner College of Business] ] .author[ ### .font70[University of Cincinnati] ] .date[ ### .font70[ January 27, 2023] ] --- #Previous experiences with regression </br> </br> -- .font140[ * I have never heard of regression/have heard of regression before, but have never used it. ] -- .font140[ * I have used regression before in a class or at work. ] --- # Main topics </br> </br> .font140[ - Simple linear regression (SLR) model ] .font140[ - Least squares estimation (LS) ] -- .font140[ - Case study of Ames housing data: How linear regression plays a role in predicting house prices. - R demonstration using Ames housing data ] --- # Introduction .font140[ * Managerial decisions are often based on the **relationship** between two or more variables. ] -- .font140[ * **Regression in business** - .blue[Predict/Estimate] sales based on: * Advertising expenditures ] -- .font140[ * If data can be obtained, a statistical procedure called **regression analysis** can be used to develop an equation showing how the variables are related. ] --- # What is regression analysis? .font140[ * **Regression analysis** is a set of statistical processes for estimating the **relationships** between a **dependent variable (Y)** and one or more **independent variables (X)**. ] -- .font140[ * **Determining marketing strategy** - Predict/Estimate sales **(Y)** based on: * Advertising expenditures **(X)** ] -- .font140[ * **Regression in everything** - Predict/Estimate house price **(Y)** based on: * house size, location, etc **(X)** ] --- #.font80[Houses for sale prices in Ames, Iowa] .center[ <img src="Zillow-IA-1-2000.PNG" width="650" height="450" > ] .center[ .font80[ Image of Ames, Iowa by Zillow ] ] <!---Atlanta, <img src="images/Zillow-GA.png" width="650" height="450" > GA --> --- # Example: Ames housing data <!--.font100--> .font100[ - We are interested in a house with 2,200 square feet. How much should we expect to pay for it? - What is the cost of adding 1,000 square feet to a house?] .center[ <img src="house-data-01.png" width="400" height="400" > ] --- # Example of a relationship <!--.font100--> .font100[ * Appears to be a **linear relationship**: **as size goes up, price goes up**. ] .center[ <img src="house-data-01.png" width="400" height="400" > ] --- # What characteristics should we use? .font140[ * To keep things simple, let us focus only on .blue[size] of the house. ] -- .font140[ * .blue[Size of house] - The information that we use to guide prediction. ] -- .font140[ * .red[Price of house] - The information that we seek to predict. ] --- # Simple linear regression .font140[ Simple linear regression is a statistical method that models the relationship between a .red[dependent variable (Y)] and .blue[ an independent variable (X)]. * .red[Y=price of house] - The .red[dependent variable] that we seek to predict. * .blue[X=size of house] - The .blue[independent variable] that we use to guide prediction. ] --- # Simple linear regression * .font100[Data]: `\(\large \left\{\left(X_i, Y_i\right)\right\}_{i=1}^n\)` * .font100[Model]: `\(\large Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)` - `\(\large Y_i\)` .font100[is a continuous response]. - `\(\large X_i\)` .font100[is a continuous predictor]. - `\(\large \beta_0\)` .font100[is the intercept of the regression line]. - `\(\large \beta_1\)` .font100[is the slope of the regression line]. - `\(\large \epsilon_i \stackrel{iid}{\sim} N\left(0, \sigma^2\right)\)`. --- # Regression analysis: Step 1 .font120[ - **State the problem ** ] -- .font120[ - **Problem**: * Examine the relationship between the (selling) price of a home and its size. * Predict the price based on observed characteristics. ] --- # Regression analysis: Step 2 .font120[ - **Data collection: Ames housing data ** ] <!---Atlanta, Data Visualization scatterplots #plot the relationship between Price and Miles or Price and Year --> .pull-left[ .font90[.purple[**Code**]] .code60[ ```R # For reproducibility *set.seed(750) # Load the data (if not already loaded) *data(ames, package = "modeldata") # Rows to select at random *ids <- sample.int(nrow(ames), size = 50) # Rraining (or model building) data *ames.m1 <- ames[ids, ] # The database is attached to the R *attach(ames.m1) # Rescale response *Price <- Sale_Price / 1000 # Rescale predictor *Size <-Gr_Liv_Area / 1000 ggplot(ames.m1, aes(x = Size, y = Price)) + geom_point(size =6,color="blue")+ labs(x = "size", y = "price", title = "Ames housing data") ``` ] ] --- # Regression analysis: Step 2 .font120[ - **Data collection: Ames housing data ** ] <!---Atlanta, Data Visualization scatterplots #plot the relationship between Price and Miles or Price and Year --> .pull-left[ .font90[.purple[**Code**]] .code60[ ```R # For reproducibility set.seed(750) # Load the data (if not already loaded) data(ames, package = "modeldata") # Rows to select at random ids <- sample.int(nrow(ames), size =50) # Rraining (or model building) data ames.m1<- ames[ids, ] # The database is attached to the R attach(ames.m1) # Rescale response Price <- Sale_Price / 1000 # Rescale predictor Size <-Gr_Liv_Area / 1000 *ggplot(ames.m1, aes(x = Size, y = Price)) + *geom_point(size =6,color="blue")+ labs(x = "size", y = "price", title = "Ames housing data") ``` ] ] .pull-right[ <img src="house-data.png" width="350" height="350" > ] --- # Regression analysis: Step 3 .font120[ - **Model fitting & estimation ** * Simple linear regression ] * .font100[Model]: `\(\large Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)` - `\(\large Y_i\)` .font100[is a continuous response]. - `\(\large X_i\)` .font100[is a continuous predictor]. - `\(\large \beta_0\)` .font100[is the intercept of the regression line]. - `\(\large \beta_1\)` .font100[is the slope of the regression line]. - `\(\large \epsilon_i \stackrel{iid}{\sim} N\left(0, \sigma^2\right)\)`. --- # Linear prediction .font120[ * Appears to be a linear relationship: .blue[as size goes up, price goes up]. * Fitting a line by the “eyeball” method: ] .center[ <img src="house-data-0.png" width="400" height="400" > ] --- # What is a good line? </br> .font160[ .center[.red[Can we do better than the eyeball method?] ] ] -- .font120[ * We desire a strategy for estimating the slope and intercept parameters in the model `\(\hat{Y} = \beta_0 + \beta_1 X\)`. ] -- .font120[ That involves * choosing a .red[criterion], i.e., quantifying how good a line is. ] -- .font120[ * and matching that with a .blue[solution] i.e., finding the best line subject to that criterion. ] --- class: clear .font140[ A reasonable goal is to **minimize** the size of **all residuals**: ] -- .font120[ - The red solid line is our .red[predictions or fitted values]: `\(\widehat{Y_i}= \beta_0 + \beta_1 X_i\)`. - **Residual errors** `\(r_i=(Y_i −\widehat{Y_i})\)` is the distance from the .blue[observed value] to the .red[red solid line]. ] .center[ <img src="LS.png" width="360" height="360" > ] --- # Least Squares (LS) .font120[ The most common approach is to use the method of least squares (LS) estimation. ] -- .font120[ * Least square choose `\(\beta_0\)` and `\(\beta_1\)` to **minimize** the residual sum of squares (RSS) `$$\sum^n_{i=1} r_i^2 = \sum^n_{i=1} (Y_i - \widehat{Y_i}) ^2 = \sum^n_{i=1} (Y_i - [\beta_0 + \beta_1 X_i ]) ^2$$` ] --- # R's built-in lm() function .font120[ * The `lm()` function can be used to fit the SLR model. - In R, type `?lm` to view the associated documentation/help page. ] -- .font120[ * The statement `lm(y ~ x, data = df)` fits an SLR model by regressing `y` on `x`, where `y` and `x` are columns in `df`. ] --- # .font90[R demonstration using Ames housing data] .font120[ Fit an SLR model to the Ames housing data using `price` as the dependent and `size` as the independent and interpret the estimated coefficients. ] .pull-left[ .font90[.purple[**Code**]] .code80[ ```R *model1 <- lm(Price ~ Size, data = ames.m1)#<< # Fit an SLR model to the data summary(model1) ``` ]] -- .pull-right[ .font90[.purple[**Output**]] .code55[ ```R Call: lm(formula = Price ~ Size, data = ames.m1) Residuals: Min 1Q Median 3Q Max -95.591 -27.706 -5.042 28.520 174.538 Coefficients: Estimate Std. Error t value Pr(>|t|) *(Intercept) -22.45 23.33 -0.962 0.341 *Size 137.18 14.96 9.173 3.96e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 48.36 on 48 degrees of freedom *Multiple R-squared: 0.6367, Adjusted R-squared: 0.6292 F-statistic: 84.14 on 1 and 48 DF, p-value: 3.958e-12 ``` ]] -- --- #Interpretation of coefficients .font120[ The estimated model is: $$ \widehat{\text{House price}} = -22.45 + 137.18 \times \text{Size}$$ ] -- .font120[ **Slope** is 137.18. - As the size of the house increases by 1 square foot, the price of the house increases by $137.18 on average. ] -- .font120[ **Intercept** is -22.45. - **Does interpreting the intercept make sense in this problem?** ] -- .font120[ `\(R^2\)` is 62.9% - 62.9% of the price of house variation explained by size of house. ] --- class: clear, middle, center .font200[ How do we minimize `\(RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2\)` ? ] -- </br> .font250.blue[ Calculus! ] --- # Need to solve a system of two equations </br> `\begin{align} \frac{\partial RSS}{\partial \beta_0} &= -2n\left(\bar{Y} - \beta_0 - \beta_1\bar{X}\right) = 0 \end{align}` `\begin{align} \frac{\partial RSS}{\partial \beta_1} &= -2\left(\sum_{i=1}^nX_iY_i - n\beta_0\bar{X} - \beta_1\sum_{i=1}^nX_i^2\right) = 0 \end{align}` --- # Derive the LS estimate of `\(\beta_1\)` `\begin{align} \frac{\partial RSS}{\partial \beta_1} &= -2\left(\sum_{i=1}^nX_iY_i - n\beta_0\bar{X} - \beta_1\sum_{i=1}^nX_i^2\right) = 0 \Rightarrow \end{align}` `\begin{align} \sum_{i=1}^nX_iY_i - n (\bar{Y} - \hat{\beta}_1 \bar{X}) \bar{X} - \hat{\beta}_1\sum_{i=1}^nX_i^2 &= 0 \Rightarrow\\ \sum_{i=1}^n \left( X_iY_i -X_i\bar{Y} + \hat{\beta}_1 \bar{X}X_i - \hat{\beta}_1 X_i^2 \right)&= 0 \Rightarrow\\ \hat{\beta}_1 = \frac{\sum_{i=1}^nX_iY_i -X_i\bar{Y} }{\sum_{i=1}^n X_i^2 - \bar{X}X_i } \Rightarrow\\ \color{blue}{\hat{\beta}_1} = \frac{\sum_{i=1}^n\left(X_i - \bar{X}\right)\left(Y_i - \bar{Y}\right)}{\sum_{i=1}^n\left(X_i - \bar{X}\right)^2} = \frac{S_{XY}}{ S_{XX}}= \color{blue}{r \frac{S_Y}{S_X}} \end{align}` --- # Derive the LS estimate of `\(\beta_0\)` </br> `\begin{align} \frac{\partial RSS}{\partial \beta_0} &= -2n\left(\bar{Y} - \beta_0 - \beta_1\bar{X}\right) = 0 \Rightarrow \\ \bar{Y} &- \hat{\beta}_0 - \hat{\beta}_1\bar{X} =0 \Rightarrow \\ \color{blue}{\hat{\beta}_0} & = \color{blue}{\bar{Y} - \hat{\beta}_1 \bar{X}} \end{align}` --- #Making preditions .font120[ - Predict the price for a house with 2,200 square feet: `\begin{align} \widehat{\text{House price}} &= -22.45 + 137.18 \times \text{Size}\\ &= -22.45 + 137.18*2.2 \\ &= 279.35 \end{align}` - The predicted price for a house with 2,200 square feet is 279.35($1,000s) = $279,350. ] .font120[ - **When using a regression model for prediction, only predict within the relevant range of data.** ] --- #Making preditions <!-- ```r head(cbind("actual values"=ames.m1$Price, "fitted values" = fitted(model1)),5) ## actual values fitted values ## 1 181.0 195.0774 ## 2 124.0 192.5470 ## 3 72.0 95.7573 ## 4 137.0 147.6315 ## 5 124.5 101.4508 head(ames.m1$Size,1) ## [1] 1.604 ``` --> .font120[ - Predict the price for a house with 2,200 square feet: `\begin{align} \widehat{\text{House price}} &= -22.45 + 137.18 \times \text{Size}\\ &= -22.45 + 137.18*2.2 \\ &= 279.35 \end{align}` - The predicted price for a house with 2,200 square feet is 279.35($1,000s) = $279,350. ] -- .pull-left[ .font90[.purple[**Code**]] .code80[ ```R *new_data <- data.frame(Size=c(2.2)) *predictions <- predict(model1, new_data) predictions ``` ]] .pull-right[ .font90[.purple[**Output**]] .code80[ ```R predictions 1 * 279.35 ``` ]] --- #.font80[Houses for sale prices in Ames, Iowa] .center[ <img src="Zillow-IA-1-2000.PNG" width="650" height="450" > ] .center[ .font80[ Image of Ames, Iowa by Zillow ] ] --- #.font80[Houses for sale prices in Ames, Iowa] .center[ <img src="Zillow-IA-1-2000-3.PNG" width="650" height="450" > ] .center[ .font80[ Image of Ames, Iowa by Zillow ] ] <!---Atlanta, <img src="images/Zillow-GA.png" width="650" height="450" > GA --> --- #.font80[Houses for sale prices in Ames, Iowa] .center[ <img src="Zillow-IA-2-2000.PNG" width="650" height="450" > ] .center[ .font80[ Image of Ames, Iowa by Zillow ] ] --- # Why are the prices different? </br> .font140[ * Many factors or variables affect the price of a house - size of house - number of baths - garage - size of land - location, etc. ] -- - .font150[ Multiple linear regression]: `\(\LARGE Y = \beta_0 + \sum_{i=1}^p \beta_i X_i + \epsilon\)` --- class: clear,middle,center </br> .font220[ Thank you! ] </br> .font120[ Zhaohu (Jonathan) Fan PhD Candidate in Business Analytics fanzh@ucmail.uc.edu ] --- #Steps in a regression analysis .font100[ - Step 1. State the problem ] .font100[ - Step 2. Data collection (more details! next class) ] .font100[ - Step 3. Model fitting & estimation (this class) * Model specification (linear? logistic? next class) * Model fitting (least squares) * Select potentially relevant variables (next class) * Model validation and criticism (next class) * Back to 3.1? Back to 2? ] --- # Assumptions of regression (L.I.N.E) .font120[ * **L**inearity - The relationship between X and Y is linear. ] -- .font120[ * **I**ndependence of Errors - Error values are statistically independent. - Particularly important when data are collected over a period of time. ] -- .font120[ * **N**ormality of Error - Error values are normally distributed for any given value of X. ] -- .font120[ * **E**qual Variance (also called homoscedasticity) - The probability distribution of the errors has constant variance. ] --- # Assumptions of regression (L.I.N.E) .font90[.purple[**Code**]] .code60[ ```R *plot(model1) ``` ] .center[ <img src="model-check.png" width="450" height="450" > ] --- # .font90[More examples of statistical relationships] </br> - .font150[ Simple linear regression]: `\(\LARGE Y = \beta_0 + \beta_1 X + \epsilon\)` - .font150[ Multiple linear regression]: `\(\LARGE Y = \beta_0 + \sum_{i=1}^p \beta_i X_i + \epsilon\)` - .font150[Polynomial regression]: `\(\LARGE Y = \beta_0 + \sum_{i=1}^p \beta_i X^i + \epsilon\)` - .font150[Nonlinear regression]: `\(\LARGE Y = \frac{\beta_1 X}{\left(\beta_2 + X\right)} + \epsilon\)` - .font150[ and more.]